Finding Domain Terms using Wikipedia
نویسندگان
چکیده
In this paper we present a new approach for obtaining the terminology of a given domain using the category and page structures of the Wikipedia in a language independent way. The idea is to take profit of category graph of Wikipedia starting with a top category that we identify with the name of the domain. After obtaining the full set of categories belonging to the selected domain, the collection of corresponding pages is extracted, using some constraints. For reducing noise a bootstrapping approach implying several iterations is used. At each iteration less reliable pages, according to the balance between on-domain and off-domain categories of the page, are removed as well as less reliable categories. The set of recovered pages and categories is selected as initial domain term vocabulary. This approach has been applied to three broad coverage domains: astronomy, chemistry and medicine, and two languages: English and Spanish, showing a promising performance. The resulting set of terms has been evaluated using as reference those terms occurring in WordNet (using Magnini's domain codes) and those appearing in SNOMED-CT (a reference resource for the Medical domain available for Spanish).
منابع مشابه
Using Wikipedia for Domain Terms Extraction
Domain terms are a useful resource for tuning both resources and NLP processors to domain specific tasks. This paper proposes a method for obtaining terms from potentially any domain using Wikipedia.
متن کاملUsing Domain-specific and Collaborative Resources for Term Translation
In this article we investigate the translation of terms from English into German and vice versa in the isolation of an ontology vocabulary. For this study we built new domainspecific resources from the translation search engine Linguee and from the online encyclopedia Wikipedia. We learned that a domainspecific resource produces better results than a bigger, but more general one. The first find...
متن کاملHarvesting Domain-Specific Terms using Wikipedia
We present a simple but effective method of automatically extracting domain-specific terms using Wikipedia as training data (i.e. self-supervised learning). Our first goal is to show, using human judgments, that Wikipedia categories are domainspecific and thus can replace manually annotated terms. Second, we show that identifying such terms using harvested Wikipedia categories and entities as s...
متن کاملComputing Semantic Relatedness using DBPedia
Extracting the semantic relatedness of terms is an important topic in several areas, including data mining, information retrieval and web recommendation. This paper presents an approach for computing the semantic relatedness of terms using the knowledge base of DBpedia — a community effort to extract structured information from Wikipedia. Several approaches to extract semantic relatedness from ...
متن کاملUsing Wikipedia to translate domain-specific terms in SMT
When building a university lecture translation system, one important step is to adapt it to the target domain. One problem in this adaptation task is to acquire translations for domain specific terms. In this approach we tried to get these translations from Wikipedia, which provides articles on very specific topics in many different languages. To extract translations for the domain specific ter...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2010